Speech recognition method and apparatus utilizing segment models

ABSTRACT

A method and apparatus determine the likelihood of a sequence of words based in part on a segment model. The segment model includes trajectory expressions formed as the product of a polynomial matrix and a generation matrix. The likelihood of the sequence of words is based in part on a segment probability derived by subtracting the trajectory expressions from a feature vector matrix that contains a sequence of feature vectors for a segment of speech. Aspects of the method and apparatus also include training the segment model using such a segment probability.

REFERENCE TO RELATED APPLICATIONS

[0001] This application is a Divisional of U.S. Patent Application09/559,509, filed on Apr. 27, 2000 and entitled SPEECH RECOGNITIONMETHOD AND APPARATUS UTLIZING SEGMENT MODELS.

BACKGROUND OF THE INVENTION

[0002] The present invention relates to speech recognition. Inparticular, the present invention relates to the use of segment modelsto perform speech recognition.

[0003] In speech recognition systems, an input speech signal isconverted into words that represent the verbal content of the speechsignal. This conversion begins by converting the analog speech signalinto a series of digital values. The digital values are then passedthrough a feature extraction unit, which computes a sequence of featurevectors based on the digital values. Each feature vector is typicallymulti-dimensional and represents a single frame of the speech signal.

[0004] To identify a most likely sequence of words, the feature vectorsare applied to one or more models that have been trained using atraining text. Typically, this involves applying the feature vectors toa frame-based acoustic model in which a single frame state is associatedwith a single feature vector.

[0005] Recently, however, segment models have been introduced thatassociate multiple feature vectors with a single segment state. Thesegment models are thought to provide a more accurate model oflarge-scale transitions in human speech.

[0006] Although current segment models provide improved modeling oflarge-scale transitions, their training time and recognition time areless than optimum. As such, more efficient segment models are needed.

SUMMARY OF THE INVENTION

[0007] A method and apparatus determine the likelihood of a sequence ofwords based in part on a segment model. The segment model includestrajectory expressions formed as the product of a generation matrix anda parameter matrix. The likelihood of the sequence of words is based inpart on a segment probability. The segment probability is derived inpart by matching the trajectory expressions to a feature vector matrixthat contains a sequence of feature vectors for a segment of speech.

[0008] Aspects of the method and apparatus also include training thesegment model using such a segment probability.

BRIEF DESCRIPTION OF THE DRAWINGS

[0009]FIG. 1 is a plan view of a general computing environment in whichone embodiment of the present invention is used.

[0010]FIG. 2 is a block diagram of a speech recognition system of anembodiment of the present invention.

[0011]FIG. 3 is a graph showing a segment model curve and a sequence offeature values.

[0012]FIG. 4 is a graph showing a segment model curve and a fitted curveof the prior art.

[0013]FIG. 5 is a graph of a segment model curve showing a graphicalrepresentation of a probability determination under the presentinvention.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

[0014]FIG. 1 and the related discussion are intended to provide a brief,general description of a suitable computing environment in which theinvention may be implemented. Although not required, the invention willbe described, at least in part, in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a personal computer. Generally, program modules includeroutine programs, objects, components, data structures, etc. thatperform particular tasks or implement particular abstract data types.Moreover, those skilled in the art will appreciate that the inventionmay be practiced with other computer system configurations, includinghand-held devices, multiprocessor systems, microprocessor-based orprogrammable consumer electronics, network PCs, minicomputers, mainframecomputers, and the like. The invention may also be practiced indistributed computing environments where tasks are performed by remoteprocessing devices that are linked through a communications network. Ina distributed computing environment, program modules may be located inboth local and remote memory storage devices.

[0015] With reference to FIG. 1, an exemplary system for implementingthe invention includes a general purpose computing device in the form ofa conventional personal computer 20, including a processing unit (CPU)21, a system memory 22, and a system bus 23 that couples various systemcomponents including the system memory 22 to the processing unit 21. Thesystem bus 23 may be any of several types of bus structures including amemory bus or memory controller, a peripheral bus, and a local bus usingany of a variety of bus architectures. The system memory 22 includesread only memory (ROM) 24 and random access memory (RAM) 25. A basicinput/output (BIOS) 26, containing the basic routine that helps totransfer information between elements within the personal computer 20,such as during start-up, is stored in ROM 24. The personal computer 20further includes a hard disk drive 27 for reading from and writing to ahard disk (not shown), a magnetic disk drive 28 for reading from orwriting to removable magnetic disk 29, and an optical disk drive 30 forreading from or writing to a removable optical disk 31 such as a CD ROMor other optical media. The hard disk drive 27, magnetic disk drive 28,and optical disk drive 30 are connected to the system bus 23 by a harddisk drive interface 32, magnetic disk drive interface 33, and anoptical drive interface 34, respectively. The drives and the associatedcomputer-readable media provide nonvolatile storage of computer readableinstructions, data structures, program modules and other data for thepersonal computer 20.

[0016] Although the exemplary environment described herein employs thehard disk, the removable magnetic disk 29 and the removable optical disk31, it should be appreciated by those skilled in the art that othertypes of computer readable media which can store data that is accessibleby a computer, such as magnetic cassettes, flash memory cards, digitalvideo disks, Bernoulli cartridges, random access memories (RAMs), readonly memory (ROM), and the like, may also be used in the exemplaryoperating environment.

[0017] A number of program modules may be stored on the hard disk,magnetic disk 29, optical disk 31, ROM 24 or RAM 25, including anoperating system 35, one or more application programs 36, other programmodules 37, and program data 38. A user may enter commands andinformation into the personal computer 20 through local input devicessuch as a keyboard 40, pointing device 42 and a microphone 43. Otherinput devices (not shown) may include a joystick, game pad, satellitedish, scanner, or the like. These and other input devices are oftenconnected to the processing unit 21 through a serial port interface 46that is coupled to the system bus 23, but may be connected by otherinterfaces, such as a sound card, a parallel port, a game port or auniversal serial bus (USB). A monitor 47 or other type of display deviceis also connected to the system bus 23 via an interface, such as a videoadapter 48. In addition to the monitor 47, personal computers maytypically include other peripheral output devices, such as a speaker 45and printers (not shown).

[0018] The personal computer 20 may operate in a networked environmentusing logic connections to one or more remote computers, such as aremote computer 49. The remote computer 49 may be another personalcomputer, a hand-held device, a server, a router, a network PC, a peerdevice or other network node, and typically includes many or all of theelements described above relative to the personal computer 20, althoughonly a memory storage device 50 has been illustrated in FIG. 1. Thelogic connections depicted in FIG. 1 include a local area network (LAN)51 and a wide area network (WAN) 52. Such networking environments arecommonplace in offices, enterprise-wide computer network Intranets, andthe Internet.

[0019] When used in a LAN networking environment, the personal computer20 is connected to the local area network 51 through a network interfaceor adapter 53. When used in a WAN networking environment, the personalcomputer 20 typically includes a modem 54 or other means forestablishing communications over the wide area network 52, such as theInternet. The modem 54, which may be internal or external, is connectedto the system bus 23 via the serial port interface 46. In a networkenvironment, program modules depicted relative to the personal computer20, or portions thereof, may be stored in the remote memory storagedevices. It will be appreciated that the network connections shown areexemplary and other means of establishing a communications link betweenthe computers may be used. For example, a wireless communication linkmay be established between one or more portions of the network.

[0020] Although FIG. 1 shows an exemplary environment, the presentinvention is not limited to a digital-computing environment. Inparticular, the present invention can be operated on analog devices ormixed signal (analog and digital) devices. Furthermore, the presentinvention can be implemented on a single integrated circuit, forexample, in small vocabulary implementations.

[0021]FIG. 2 provides a more detailed block diagram of modules of thegeneral environment of FIG. 1 that are particularly relevant to thepresent invention. In FIG. 2, an input speech signal is converted intoan electrical signal by a microphone 100, which is connected to ananalog-to-digital (A-to-D) converter 102. A-to-D converter 102 convertsthe analog signal into a series of digital values. In severalembodiments, A-to-D converter 102 samples the analog signal at 16 kHzthereby creating 16 kilobytes of speech data per second.

[0022] The digital data created by A-to-D converter 102 is provided to afeature extractor 104 that extracts a feature from the digital speechsignal. Examples of feature extraction modules include modules forperforming Linear Predictive Coding (LPC), LPC derived cepstrum,Perceptive Linear Prediction (PLP), Auditory model feature extraction,and Mel-Frequency Cepstrum Coefficients (MFCC) feature extraction. Notethat the invention is not limited to these feature extraction modulesand that other modules may be used within the context of the presentinvention.

[0023] The feature extraction module receives the stream of digitalvalues from A-to-D converter 102, and produces a stream of featurevectors that are each associated with a frame of the speech signal. Inmany embodiments, the centers of the frames are separated by 10milliseconds.

[0024] The stream of feature vectors produced by the extraction moduleis provided to a decoder 106, which identifies a most likely sequence ofwords based on the stream of feature vectors, a segment model 111, alanguage model 110, and a lexicon 112.

[0025] Segment model 111 indicates how likely it is that a sequence offeature vectors would be produced by a segment of a particular duration.The segment model uses multiple feature vectors at the same time to makea determination about the likelihood of a particular segment. Because ofthis, it provides a good model of large-scale transitions in the speechsignal. In addition, the segment model looks at multiple durations foreach segment and determines a separate probability for each duration. Assuch, it provides a more accurate model for segments that have longerdurations.

[0026] Language model 110 provides a set of likelihoods that aparticular sequence of words will appear in the language of interest. Inmany embodiments, the language model is based on a text database such asthe North American Business News (NAB), which is described in greaterdetail in a publication entitled CSR-III Text Language Model, Universityof Penn., 1994. The language model may be a context-free grammar or astatistical N-gram model such as a trigram. In one embodiment, thelanguage model is a compact trigram model that determines theprobability of a sequence of words based on the combined probabilitiesof three-word segments of the sequence.

[0027] Based on the segment model, the language model, and the lexicon,decoder 106 identifies a most likely sequence of words from all possibleword sequences. The particular method used to select the most probablesequence of words is discussed further below.

[0028] The most probable sequence of hypothesis words is provided toconfidence measure module 114. Confidence measure module 114 identifieswhich words are most likely to have been improperly identified by thespeech recognizer, based in part on a frame-based acoustic model.Confidence measure module 114 then provides the sequence of hypothesiswords to an output module 126 along with identifiers indicating whichwords may have been improperly identified. Those skilled in the art willrecognize that confidence measure module 114 is not necessary for thepractice of the present invention.

[0029] Before segment model 111 may be used to decode a sequence ofinput feature vectors, it must be trained. In FIG. 2, such training isperformed by trainer 140 based on training text 142, past modelparameters from segment model 111 and training feature vectors fromfeature extractor 104. The method of training under the presentinvention is discussed further below. Those skilled in the art willrecognize that a speech recognition system does not need trainer 140 ifits models have been previously trained.

[0030] In most embodiments of the present invention, the segment modeldescribed above is a parametric trajectory segment model. In oneembodiment, the parameters for a trajectory assume the form ofpolynomials. However, those skilled in the art will appreciate thattrajectory expressions can be made using basis functions other thanpolynomials such as wavelets and sinusoidal functions. Although thedescription in the following uses polynomials as an example, the presentinvention may be applied to other trajectory modeling techniques aswell.

[0031] In a polynomial trajectory model, each component or dimension ofthe observed feature vector is modeled by a family of n-order polynomialfunctions. Thus, if there are twenty components (or dimensions) in afeature vector, there will be a family of twenty n-order polynomialfunctions. The polynomial function for any one dimension of the featurevectors describes a smooth curve that extends through the framesassociated with the feature vectors.

[0032]FIG. 3 provides a graph showing a curve described by a polynomialfunction for a single dimension of the feature vectors. In FIG. 3, timeis shown along horizontal axis 300 and the value of the dth dimension ofthe feature vector is shown on vertical axis 302.

[0033] In FIG. 3, actual values for the dth component at selected framesare shown as dots. For example, dot 304 shows the actual value of thedth dimension for frame 306. Similarly, dot 308 represents the value ofthe dth dimension for the feature vector at frame 310. Curve 312 of FIG.3 represents the segment model curve generated by the polynomialfunction for a particular segment. Note that FIG. 3 only shows onedimension of the feature vectors and that separate curves are providedfor each dimension within the segment model.

[0034] The segment model curve of FIG. 3 can be described mathematicallyin relation to an observation vector Y_(t) from a series of observationvectors Y=(Y₀,Y_(1 . . .) Y_(T−1)) as:

y _(t) =CF _(t) +e _(t) (Σ)  EQ.1

[0035] Where Y_(t) is a D-dimensional feature vector, C is a trajectoryparameter matrix, F_(t) is a trajectory generation matrix containing afamily of polynomial functions and e_(t)(Σ) is a residual fitting error.Thus, equation 1 can be expanded into full matrix notation as:$\begin{matrix}{\begin{pmatrix}y_{t,1} \\y_{t,2} \\\vdots \\y_{t,D}\end{pmatrix} = {{\begin{pmatrix}c_{1}^{0} & c_{1}^{1} & \cdots & c_{1}^{N} \\c_{2}^{0} & c_{2}^{1} & \cdots & c_{2}^{N} \\\vdots & \vdots & \vdots & \vdots \\c_{D}^{0} & c_{D}^{1} & c_{D}^{2} & c_{D}^{N}\end{pmatrix}\begin{pmatrix}{f_{0}(t)} \\{f_{1}(t)} \\\vdots \\{f_{N}(t)}\end{pmatrix}} + {e_{t}(\Sigma)}}} & {{EQ}.\quad 2}\end{matrix}$

[0036] Where y_(t) _(l) _(d) represents the dth dimension of the tthfeature vector, C_(d) ^(n) represents a weighting parameter for the n-thpolynomial associated with the dth dimension and f_(n)(t) represents thenth polynomial evaluated at time t. In one embodiment of equation 2,each polynomial function f_(n)(t) is a Legendre polynomial of order n.In another embodiment, f₁(t) is a first order polynomial, f₂(t) is asecond order polynomial and so on. To simplify computations, thepolynomials are often drawn from a collection of orthogonal functions asmentioned in the previous two embodiments. Those skilled in the art,however, can appreciate that other orthogonal and/or non-orthogonalpolynomials can also be used for segment modeling purposes. In mostembodiments, the distribution of the residual error is often assumed tobe an independent and identically distributed random process with one ora mixture of normal distributions of zero means and covariance matricesof Σ's.

[0037] Equation 2 can be expanded to describe all of the feature vectorsin a segment containing T feature vectors. Thus, equation 2 becomes:$\begin{matrix}{\begin{pmatrix}y_{0,1} & y_{1,1} & \cdots & y_{{T - 1},1} \\y_{0,2} & y_{1,2} & \cdots & y_{{T - 1},2} \\\vdots & \vdots & \vdots & \vdots \\y_{o,D} & y_{1,D} & \cdots & y_{{T - 1},D}\end{pmatrix} = {{\begin{pmatrix}c_{1}^{0} & c_{1}^{1} & \cdots & c_{1}^{N} \\c_{2}^{0} & c_{2}^{1} & \cdots & c_{2}^{N} \\\vdots & \vdots & \vdots & \vdots \\c_{D}^{0} & c_{D}^{1} & \cdots & c_{D}^{N}\end{pmatrix}\begin{pmatrix}{f_{0}(0)} & \cdots & {f_{0}\left( {T - 1} \right)} \\{f_{1}(0)} & \cdots & {f_{1}\left( {T - 1} \right)} \\\quad & \vdots & \quad \\{f_{N}(0)} & \cdots & {f_{N}\left( {T - 1} \right)}\end{pmatrix}} + \begin{pmatrix}e_{0,1} & \cdots & e_{{T - 1},1} \\e_{0,2} & \cdots & e_{{T - 1},2} \\\vdots & \vdots & \vdots \\e_{0,D}^{0} & \cdots & e_{{T - 1},D}\end{pmatrix}}} & {{EQ}.\quad 3}\end{matrix}$

[0038] Where Y_(T−1,1) represents the first dimension of the featurevector at time T−1 and f₀(T−1) represents the zero order polynomialevaluated at time T−1.

[0039] Equation 3 can be represented generally by:

Y _(k) =C _(k) F+E _(k)(Σ)  EQ. 4

[0040] Where Y_(k) represents the feature vector matrix for a segment k,C_(k) represents the trajectory parameter matrix for segment k, Frepresents the trajectory generation matrix, and E_(k)(Σ) represents anerror matrix that is based on a covariance matrix Σ.

[0041] Each segment state m in the model is associated with a word orsub-word unit and is defined by its parameter matrix C_(m) and itscovariance matrix Σ_(m), which are referred to more generally asprobabilistic parameters. During training of the segment model, theparameter matrix and the covariance matrix are chosen so that the modelbest matches the feature vectors provided for the respective sub-word orword unit in the training data. A particular method for training asegment model under the present invention is discussed further below.

[0042] Under the prior art, training the segment model involved a curvefitting step where the training feature vectors were used to generate acurve fitting parameter matrix C_(k) and a curve fitting covariancematrix Σ_(k) for each segment k in the training speech signal. Under onesystem of the prior art, this curve fitting involves calculating theparameter matrix as:

C _(k) =Y _(k) F ^(t) [FF ^(t)]⁻¹  EQ.5

[0043] Where C_(k) is the curve fitting parameter matrix for segment k,Y_(k) is the observed sequence of feature vectors for the segment, F isthe generation matrix, and the superscript t indicates a transpose. Thecurve fitting covariance matrix Σ_(k) is calculated under the prior artas: $\begin{matrix}{\Sigma_{k} = \frac{\left( {Y_{k} - {C_{k}F}} \right)\left( {Y_{k} - {C_{k}F}} \right)^{t}}{T_{k}}} & {{EQ}.\quad 6}\end{matrix}$

[0044] where Σ_(k) is the curve fitting covariance matrix, Y_(k) is theobserved sequence of feature vectors for the current segment, C_(k) isthe curve fitting parameter matrix, F is the generation matrix, andT_(k) is the number of feature vectors (duration) being analyzed for thecurrent segment k.

[0045] Once the curve fitting matrices have been formed for each of thesegments in the training text, the model is constructed. This istypically done using an expectation-maximization algorithm (EM). Underthis algorithm, the model is iteratively changed until it becomesstable. Specifically, at iteration i, the model parameter matrix C_(m)^(i) and model covariance matrix Σ_(m) ^(i) for the mth segment modelare determined as: $\begin{matrix}{C_{m}^{i} = {\left\lbrack {\sum\limits_{k = 1}^{K}{\gamma_{m|k}^{i}C_{k}F_{T_{k}}F_{T_{k}}^{t}}} \right\rbrack \left\lbrack {\sum\limits_{k = 1}^{K}{\gamma_{m|k}^{i}F_{T_{k}}^{t}F_{T_{k}}}} \right\rbrack}^{- 1}} & {{EQ}.\quad 7} \\{\Sigma_{m}^{i} = \frac{\sum\limits_{k = 1}^{K}{{\gamma_{m|k}^{i}\left( {{C_{k}F_{T_{k}}} - {C_{m}^{i}F_{T_{k}}}} \right)}\left( {{C_{k}F_{T_{k}}} - {C_{m}^{i}F_{T_{k}}}} \right)^{t}}}{\sum\limits_{k = 1}^{K}{\gamma_{m|k}^{i}T_{k}}}} & {{EQ}.\quad 8}\end{matrix}$

[0046] Where k is the current input segment, K is the total number ofinput segments in the training data, m is the current segment model,Y_(k) is the sequence of feature vectors for the current segment k,F_(T) _(k) is the generation matrix evaluated at the time periodsassociated with the current segment, the superscript t indicates atranspose function, C_(k) is the curve fitting parameter matrix forsegment k, T_(k) is the number of feature vectors for segment k, andγ_(m|k) ^(i) is the probability of the mth segment model given thecurrent segment k. This probability is calculated as: $\begin{matrix}{\gamma_{m|k}^{i} = \frac{{p\left( {C_{k},\left. \Sigma_{k} \middle| C_{m}^{i - 1} \right.,\Sigma_{m}^{i - 1}} \right)}w_{m}^{i - 1}}{\sum\limits_{j = 1}^{M}\left\lbrack {{p\left( {C_{k},\left. \Sigma_{k} \middle| C_{j}^{i - 1} \right.,\Sigma_{j}^{i - 1}} \right)}w_{j}^{i - 1}} \right\rbrack}} & {{EQ}.\quad 9}\end{matrix}$

[0047] Where p(C_(k), Σ_(k)|C_(m) ^(i−1), Σ_(m) ^(i−1)) is theprobability of the curve fitting parameter matrix and curve fittingcovariance matrix for the current segment given the model parametermatrix and the model covariance matrix calculated at a previousiteration, i−1, and M is the total number of segment models. In Equation9, W_(m) ^(i−1) is the mixture weight for segment model, m, at theprevious iteration. The new mixture weight for the current iteration canbe obtained according to: $\begin{matrix}{w_{m}^{i} = {\frac{1}{K}{\sum\limits_{k = 1}^{K}\left\lbrack \frac{w_{m}^{i - 1} \cdot {p\left( {C_{k},\left. \Sigma_{k} \middle| C_{m}^{i - 1} \right.,\Sigma_{m}^{i - 1}} \right)}}{\sum\limits_{j = 1}^{M}\left\lbrack {{p\left( {C_{k},\left. \Sigma_{k} \middle| C_{j}^{i - 1} \right.,\Sigma_{j}^{i - 1}} \right)}w_{j}^{i - 1}} \right\rbrack} \right\rbrack}}} & {{EQ}.\quad 10}\end{matrix}$

[0048] The denominators shown in equations 9 and 10 provide the totalprobability of the curve fitting parameter matrix and the curve fittingcovariance matrix for the current segment given all of the availablesegment models of the previous iteration. We could use the followingequation to denote this total likelihood. $\begin{matrix}{{P\left( {C_{k},\Sigma_{k}} \right)} = {\sum\limits_{j = 1}^{M}\left\lbrack {{p\left( {C_{k},\left. \Sigma_{k} \middle| C_{j}^{i - 1} \right.,\Sigma_{j}^{i - 1}} \right)}w_{j}^{i - 1}} \right\rbrack}} & {{EQ}.\quad 11}\end{matrix}$

[0049] In equations 9 and 10, the probability of the curve fittingparameter matrix and the curve fitting covariance matrix for the currentsegment given a model parameter matrix and a model covariance matrix,p(C_(k), Σ_(k)|C_(m), Σ_(m)), is calculated as: $\begin{matrix}{{p\left( {C_{k},\left. \Sigma_{k} \middle| C_{m} \right.,\Sigma_{m}} \right)} = \frac{\exp \left( {{{- \frac{1}{2}}{{tr}\left\lbrack {\left( {{C_{k}F_{T_{k}}} - {C_{m}F_{T_{k}}}} \right)\quad {\Sigma_{m}^{- 1}\left( {{C_{k}F_{T_{k}}} - {C_{m}F_{T_{k}}}} \right)}^{t}} \right\rbrack}} - {\frac{T_{k}}{2}{{tr}\left( {\Sigma_{m}^{- 1}\Sigma_{k}} \right)}}} \right)}{\left( {\left( {2\quad \pi} \right)^{\frac{{DT}_{k}}{2}}{\Sigma_{m}}^{\frac{T_{k}}{2}}} \right)}} & {{EQ}.\quad 12}\end{matrix}$

[0050] Where D is the dimension of the feature vectors, C_(m) is themodel parameter matrix and Σ_(m) ⁻¹ is the inverse of the modelcovariance matrix. Note that the iteration marker i has been removed forsimplicity since we can use C_(m) ^(i) and Σ_(m) ^(i) to replace C_(m)and Σ_(m) for each iteration. In equation 12, the superscript tindicates a transpose function.

[0051] Note that in equation 12, the probability is based on thedifference being taken between the curve fitting parameter matrix andthe model parameter matrix plus the curve fitting error. This means thata curve fitting step must be performed for each segment in the trainingdata before the model can be generated. Because of this, the modeltraining process of the prior art is inefficient.

[0052] Once the model of the prior art is constructed, it can be used todetermine the likelihood of an observed segment for a particularsub-word unit. Under the prior art, this likelihood was determined byfirst performing curve fitting on the observed feature vectors. Thisresulted in a curve fitting parameter matrix, C_(k), and a curve fittingcovariance matrix Σ_(k) for each possible segment in the input speechpattern. Under the prior art, these curve fitting matrices werecalculated using equations 5 and 6 above. Once the curve fittingmatrices had been calculated, the prior art calculated the likelihood ofthe curve fitting matrices given the model parameter matrix and themodel covariance matrix. This likelihood was calculated using equations11 and 12 above.

[0053]FIG. 4 depicts a graphical representation of the way in which theprior art determined the probability of the feature vectors of a currentsegment given a model. In FIG. 4, time is shown along the horizontalaxis 400 and the magnitude of a single dimension of the feature vectorsis shown along vertical axis 402. FIG. 4 shows two curves, curve 404 isthe model curve generated by the model parameter matrix and modelcovariance matrix. Curve 406 is the curve generated by a curve fittingparameter matrix and curve fitting covariance matrix for the currentsegment. Under the prior art, the probability of the current segmentgiven the model was essentially determined by measuring the distancesbetween curve 404 and 406 at the feature vector time marks plus thecurve fitting error. Thus, the differences were calculated at time marks408, 410, 412, 414 and 416. Note that the distances between curve 406and the individual feature vector values 420, 422, 424, 426, and 428 aremeasured to form the curve fitting error.

[0054] Thus, under the decoding technique of the prior art, a curvefitting step had to be performed for each possible segment in the inputspeech signal. Since several different segmentations are possible in aspeech signal, a large number of curve fitting operations had to beperformed under the prior art. Because these curve fitting operationsare time consuming, they cause the prior art detection to beinefficient.

[0055] The present invention overcomes the inefficiencies of training atrajectory segment model and using a trajectory segment model to decodespeech signals into sequences of words. The present invention improvesthe efficiency by removing the step of generating curve fitting matricesfor the segments during training and decoding.

[0056] Specifically, under the present invention, the model parametermatrix, C_(m) ^(i) and the model covariance matrix Σ_(m) ^(i) aretrained using the following equations: $\begin{matrix}{C_{m}^{i} = {\left\lbrack {\sum\limits_{k = 1}^{K}{\gamma_{m|k}^{i}Y_{k}F_{T_{k}}^{t}}} \right\rbrack \left\lbrack {\sum\limits_{k = 1}^{K}{\gamma_{m|k}^{i}Y_{k}F_{T_{k}}^{t}}} \right\rbrack}^{- 1}} & {{EQ}.\quad 13} \\{\Sigma_{m}^{i} = \frac{\sum\limits_{k = 1}^{K}{{\gamma_{m|k}^{i}\left( {Y_{k} - {C_{m}^{i}F_{T_{k}}}} \right)}\left( {Y_{k} - {C_{m}^{i}F_{T_{k}}}} \right)^{t}}}{\sum\limits_{k = 1}^{K}{\gamma_{m|k}^{i}T_{k}}}} & {{EQ}.\quad 14}\end{matrix}$

[0057] where i is the training iteration, m is the current model, k isthe current segment, K is the total number of segments in the trainingutterance, F_(T) _(k) is the trajectory generation matrix evaluated atthe current segment, T_(k) is the total number of feature vectors in thecurrent segment, Y_(k) is the sequence of observed feature vectors forthe current segment, and γ_(m|k) ^(i) is the probability of the mthmodel given the current segment k. This probability is defined under thepresent invention as: $\begin{matrix}{\gamma_{m|k}^{i} = \frac{{p\left( {\left. Y_{k} \middle| C_{m}^{i - 1} \right.,\Sigma_{m}^{i - 1}} \right)} \cdot w_{m}^{i - 1}}{\sum\limits_{j = 1}^{M}\left\lbrack {{p\left( {\left. Y_{k} \middle| C_{j}^{i - 1} \right.,\Sigma_{j}^{i - 1}} \right)}w_{j}^{i - 1}} \right\rbrack}} & {{EQ}.\quad 15}\end{matrix}$

[0058] where C_(m) ^(i−1) and Σ_(m) ^(i−1) are the model parametermatrix and model covariance matrix of the previous training iteration,Y_(k) is the sequence of observed feature vectors for the currentsegment k, and w_(m) ^(i−1) is the mixture weight for segment model, m,at the previous iteration. The new mixture weight for current iterationcan be obtained according to: $\begin{matrix}{w_{m}^{i} = {\frac{1}{K}{\sum\limits_{k = 1}^{K}\quad \left\lbrack \frac{{p\left( {\left. Y_{k} \middle| C_{m}^{i - 1} \right.,\sum\limits_{m}^{i - 1}} \right)} \cdot w_{m}^{i - 1}}{\sum\limits_{j = 1}^{M}\quad \left\lbrack {{p\left( {\left. Y_{k} \middle| C_{j}^{i - 1} \right.,\sum\limits_{j}^{i - 1}} \right)}w_{j}^{i - 1}} \right\rbrack} \right\rbrack}}} & {{EQ}.\quad 16}\end{matrix}$

[0059] The denominators shown in equations 15 and 16 provide the totalprobability of the feature vectors for the current segment given all ofthe available segment models of the previous iteration. We could use thefollowing equation to denote this total likelihood. $\begin{matrix}{{P\left( Y_{k} \right)} = {\sum\limits_{j = 1}^{M}\quad \left\lbrack {{p\left( {\left. Y_{k} \middle| C_{j}^{i - 1} \right.,\sum\limits_{j}^{i - 1}} \right)}w_{j}^{i - 1}} \right\rbrack}} & {{EQ}.\quad 17}\end{matrix}$

[0060] In equations 15 and 16, the probability of the observed sequenceof feature vector Y_(k), given the model parameter matrix and the modelcovariance matrix, p(Y_(k)|C_(m), Σ_(m)), is calculated under oneembodiment of the present invention using a Gaussian distribution of:$\begin{matrix}{{p\left( {\left. Y_{k} \middle| C_{m} \right.,\sum\limits_{m}} \right)} = \frac{\exp \left( {{- \frac{1}{2}}{{tr}\left\lbrack {\left( {Y_{k} - {C_{m}F_{T_{k}}}} \right){\sum\limits_{m}^{- 1}\left( {Y_{k} - {C_{m}F_{T_{k}}}} \right)^{t}}} \right\rbrack}} \right)}{\left( {2\quad \pi} \right)^{D\quad {T_{k}/2}}{\sum\limits_{m}}^{T_{k}/2}}} & {{EQ}.\quad 18}\end{matrix}$

[0061] where Y_(k) is a sequence of T_(k) feature vectors of dimension Dfor a segment k, C_(m) is a model parameter matrix for a segment modelstate m, F_(T) _(k) is a model generation matrix evaluated at the timeperiods associated with the feature vectors, Σ_(m) ⁻¹ is the inverse ofthe covariance matrix for segment model state m, and the superscript trepresents the transpose function.

[0062] Note that equations 13, 14, 16 and 18 do not require a curvefitting parameter matrix or a curve fitting covariance matrix. Instead,the model parameter matrix and model covariance matrix are constructedsimply by comparing the observation vectors directly to the previousmodel. The difference between this approach and the approach of theprior art can be seen by comparing FIG. 4 to FIG. 5 of the presentinvention. In FIG. 4, the model was trained based on a comparisonbetween a curve associated with the curve fitting matrices and a curveassociated with the model matrices plus the curve fitting error. In FIG.5, however, the model is trained by comparing a curve based on the modelmatrices of the previous iteration with the actual feature vectors.Specifically, in FIG. 5, a curve 500 based on the model matrices isshown for a single dimension of the feature vectors. In FIG. 5, time isshown along horizontal axis 502 and the magnitude of the featurevectors' dimensional component is shown along vertical axis 504. Thevalues of the dimensional component for the sequence of feature vectorsare shown as dots 506, 508, 510, 512 and 514.

[0063] Because the present invention does not need to perform curvefitting while generating the model, it requires less time to generatethe model.

[0064] The present invention also does not require curve fitting duringsegment decoding. Under the present invention, the probability of asequence of feature vectors is calculated directly from the modelparameter matrix and the model covariance matrix using equations 17 and18 above. Thus, for each segment, the probability of a sequence ofobserved feature vectors is determined based on the proximity of thosefeature vectors to a curve formed from the model parameter matrix andthe model covariance matrix. Since the present invention does notrequire the generation of curve fitting matrices as in the prior art,the decoding process under the present invention is much more efficient.

[0065] Although the present invention has been described with referenceto particular embodiments, workers skilled in the art will recognizethat changes may be made in form and detail without departing from thespirit and scope of the invention.

What is claimed is:
 1. A speech recognition system for identifying wordsfrom a series of feature vectors representing speech, the systemcomprising: a segment model capable of providing a trajectory expressionfor each of a set of segment states; and a decoder capable of generatinga path score that is indicative of the probability that a sequence ofwords is represented by the series of feature vectors, the path scorebeing based on a feature probability that is determined in part based ondifferences between a sequence of feature vectors and a segment state'strajectory expression.
 2. The speech recognition system of claim 1wherein the segment model comprises a set of probabilistic parametersand wherein the feature probability represents the probability of thesequence of feature vectors given the probabilistic parameters.
 3. Thespeech recognition system of claim 2 wherein the probabilisticparameters comprise a trajectory parameter matrix and a covariancematrix.
 4. The speech recognition system of claim 2 further comprising atrainer for training the probabilistic parameters for each segmentstate.
 5. The speech recognition system of claim 4 wherein the traineradaptively trains the probabilistic parameters based on a probabilitythat is determined in part by taking the difference between a sequenceof training feature vectors and the trajectory expression provided bythe segment model.
 6. The speech recognition system of claim 5 whereinthe probabilistic parameters comprise a trajectory parameter matrix thatis adaptively trained according to:$C_{m}^{i} = {\left\lbrack {\sum\limits_{k = 1}^{K}\quad {\gamma_{m|k}^{i}Y_{k}F_{T_{k}}^{t}}} \right\rbrack \left\lbrack {\sum\limits_{k = 1}^{K}\quad {\gamma_{m|k}^{i}Y_{k}F_{T_{k}}^{t}}} \right\rbrack}^{- 1}$

where i is a training iteration, m is current segment state, C_(m) ^(i)is the trajectory parameter matrix for the segment state m calculated attraining iteration i, k is a current segment of a training utterance, Kis a total number of segments in the training utterance, F_(T) _(k) is atrajectory generation matrix for T_(k) feature vectors, T_(k) is thetotal number of training feature vectors in the current segment, Y_(k)is the current sequence of training feature vectors in the currentsegment, superscript t represents a transpose function, and γ_(m|k) ^(i)is the probability of the mth model given the current segment k.
 7. Thespeech recognition system of claim 6 wherein γ_(m|k) ^(i) is calculatedbased in part on a feature probability that provides the likelihood ofthe current sequence of training feature vectors given a segment modelof a previous iteration, the feature probability calculated as:$\begin{matrix}{{p\left( {\left. Y_{k} \middle| C_{m}^{i - 1} \right.,\overset{i - 1}{\sum\limits_{m}}} \right)} = \frac{\exp \left( {{- \frac{1}{2}}{{tr}\left\lbrack {\left( {Y_{k} - {C_{m}^{i - 1}F_{T_{k}}}} \right)\left( \sum\limits_{m}^{i - 1} \right)^{- 1}\left( {Y_{k} - {C_{m}^{i - 1}F_{T_{k}}}} \right)^{t}} \right\rbrack}} \right)}{\left( {2\quad \pi} \right)^{D\quad {T_{k}/2}}{\overset{i - 1}{\sum\limits_{m}}}^{T_{k}/2}}} & \quad\end{matrix}$

where p(Y_(k)|C_(m) ^(i−1), Σ_(m) ^(i−1)) is the feature probability,Y_(k) is the sequence of T_(k) training feature vectors of dimension Dfor current segment k, C_(m) ^(i−1) is the trajectory parameter matrixfor segment state m for the previous training iteration i−1, Σ_(m)^(i−1) is the covariance matrix for segment state m for the previoustraining iteration i−1, (Σ_(m) ^(i−1))⁻¹ is the inverse of thecovariance matrix for segment state m for the previous trainingiteration i−1, F_(T) _(k) is a trajectory generation matrix for T_(k)feature vectors, and the superscript t represents the transposefunction.
 8. A method of speech recognition comprising: accessing asegment model's description of a curve for a segment of speech;determining differences between the curve and input feature vectorsassociated with the segment of speech; using the differences todetermine a segment probability that describes the likelihood of theinput feature vectors given the segment model; and identifying a mostlikely sequence of hypothesized words based in part on the segmentprobability.
 9. The method of claim 8 further comprising training thesegment model through an iterative process that trains a presentiteration's segment model in part by determining differences betweentraining feature vectors and a curve described by a previous iteration'ssegment model.
 10. The method of claim 9 wherein determining differencesbetween the training feature vectors and a curve described by a previousiteration's segment model comprises multiplying a parameter matrix of aprevious iteration by a generation matrix to produce a product andsubtracting the product from a matrix containing the training featurevectors.
 11. The method of claim 10 wherein training the presentiteration's segment model further comprises determining the presentiteration's parameter matrix through the calculation of:$C_{m}^{i} = {\left\lbrack {\sum\limits_{k = 1}^{K}\quad {\gamma_{m|k}^{i}Y_{k}F_{T_{k}}^{t}}} \right\rbrack \left\lbrack {\sum\limits_{k = 1}^{K}\quad {\gamma_{m|k}^{i}Y_{k}F_{T_{k}}^{t}}} \right\rbrack}^{- 1}$

where i is the present training iteration, m is current segment state,C_(m) ^(i) is the parameter matrix for the segment state m calculated atthe present training iteration i, k is a current segment of a trainingutterance, K is a total number of segments in the training utterance,F_(T) _(k) is the generation matrix for T_(k) feature vectors, T_(k) isthe total number of training feature vectors in the current segment,Y_(k) is the current sequence of training feature vectors in the currentsegment, superscript t represents a transpose function, and γ_(m|k) ^(i)is the probability of the mth model given the current segment k.
 12. Themethod of claim 11 wherein γ_(m|k) ^(i) is calculated based in part on afeature probability that provides the likelihood of the current sequenceof training feature vectors given a parameter matrix of a previousiteration and a covariance matrix of a previous iteration, the featureprobability calculated as: $\begin{matrix}{{p\left( {\left. Y_{k} \middle| C_{m}^{i - 1} \right.,\overset{i - 1}{\sum\limits_{m}}} \right)} = \frac{\exp \left( {{- \frac{1}{2}}{{tr}\left\lbrack {\left( {Y_{k} - {C_{m}^{i - 1}F_{T_{k}}}} \right)\left( \sum\limits_{m}^{i - 1} \right)^{- 1}\left( {Y_{k} - {C_{m}^{i - 1}F_{T_{k}}}} \right)^{t}} \right\rbrack}} \right)}{\left( {2\quad \pi} \right)^{D\quad {T_{k}/2}}{\overset{i - 1}{\sum\limits_{m}}}^{T_{k}/2}}} & \quad\end{matrix}$

where p(Y_(k)|C_(m) ^(i−1), Σ_(m) ^(i−1)) is the feature probability,Y_(k) is the sequence of T_(k) training feature vectors of dimension Dfor current segment k, C_(m) ¹⁻¹ is the parameter matrix for segmentstate m for the previous training iteration i−1, Σ_(m) ^(i−1) is thecovariance matrix for segment state m for the previous trainingiteration i−1, (Σ_(m) ^(i−1))⁻¹ is the inverse of the covariance matrixfor segment state m for the previous training iteration i−1, F_(T) _(k)is the generation matrix for T_(k) feature vectors, and the superscriptt represents the transpose function.
 13. A method of training a speechrecognition system using training feature vectors generated from atraining speech signal, the method comprising: segmenting the trainingfeature vectors into segments aligned with units of training text;determining differences between training feature vectors of a segmentand a curve defined by a segment model associated with the segment'sunit of text; and using the differences to determine a revised segmentmodel for the unit of text.
 14. A computer-readable medium havingcomputer-executable components for performing steps comprising:evaluating trajectory expressions at selected frames of a speech signal,the trajectory expressions representing a segment model for a speechrecognition system; determining differences between the evaluatedtrajectory expressions and feature vectors generated from a speechsignal; using the differences to determine a segment probability thatdescribes the likelihood of the feature vectors given the segment model;and identifying the likelihood of a sequence of words being present inthe speech signal based in part on the segment probability.